

import sys
import os
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
display(HTML("<style>.output_result { max-width:95% !important; }</style>"))
# # this will help in making the Python code more structured automatically (good coding practice)
# %load_ext nb_black
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# library to help imputing mode
from scipy.stats import mode
import scipy.stats as stats
# Library to split data
from sklearn.model_selection import train_test_split
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Libraries to build decision tree classifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To build model for prediction
from sklearn import tree
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
precision_recall_curve,
roc_curve,
make_scorer,
)
# to display and store Matplotlib plots within a Python Jupyter notebook
%matplotlib inline
# enable retina display
%config InlineBackend.figure_format='retina'
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
np.set_printoptions(edgeitems=20, linewidth=100)
np.set_printoptions(suppress=True)
pd.set_option("expand_frame_repr", False)
sns.set_style(style="darkgrid")
# read Loan_Modelling.csv file
data_path = "/content/sample_data/Tourism.xlsx"
# data_file = "Tourism.xlsx"
data = pd.read_excel(data_path, sheet_name="Tourism")
data
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 200000 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3 | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 1 | 0.0 | Manager | 20993.0 |
| 1 | 200001 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 3 | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | 1 | 2.0 | Manager | 20130.0 |
| 2 | 200002 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0 | 0.0 | Executive | 17090.0 |
| 3 | 200003 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 2 | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | 1 | 1.0 | Executive | 17909.0 |
| 4 | 200004 | 0 | NaN | Self Enquiry | 1 | 8.0 | Small Business | Male | 2 | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | 1 | 0.0 | Executive | 18468.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4883 | 204883 | 1 | 49.0 | Self Enquiry | 3 | 9.0 | Small Business | Male | 3 | 5.0 | Deluxe | 4.0 | Unmarried | 2.0 | 1 | 1 | 1 | 1.0 | Manager | 26576.0 |
| 4884 | 204884 | 1 | 28.0 | Company Invited | 1 | 31.0 | Salaried | Male | 4 | 5.0 | Basic | 3.0 | Single | 3.0 | 1 | 3 | 1 | 2.0 | Executive | 21212.0 |
| 4885 | 204885 | 1 | 52.0 | Self Enquiry | 3 | 17.0 | Salaried | Female | 4 | 4.0 | Standard | 4.0 | Married | 7.0 | 0 | 1 | 1 | 3.0 | Senior Manager | 31820.0 |
| 4886 | 204886 | 1 | 19.0 | Self Enquiry | 3 | 16.0 | Small Business | Male | 3 | 4.0 | Basic | 3.0 | Single | 3.0 | 0 | 5 | 0 | 2.0 | Executive | 20289.0 |
| 4887 | 204887 | 1 | 36.0 | Self Enquiry | 1 | 14.0 | Salaried | Male | 4 | 4.0 | Basic | 4.0 | Unmarried | 3.0 | 1 | 3 | 1 | 2.0 | Executive | 24041.0 |
4888 rows × 20 columns
# copying data to another varaible to avoid any changes to original data
df = data.copy() # dataframe for `travel pack` data
df.head()
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 200000 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3 | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 1 | 0.0 | Manager | 20993.0 |
| 1 | 200001 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 3 | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | 1 | 2.0 | Manager | 20130.0 |
| 2 | 200002 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0 | 0.0 | Executive | 17090.0 |
| 3 | 200003 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 2 | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | 1 | 1.0 | Executive | 17909.0 |
| 4 | 200004 | 0 | NaN | Self Enquiry | 1 | 8.0 | Small Business | Male | 2 | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | 1 | 0.0 | Executive | 18468.0 |
df.tail()
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4883 | 204883 | 1 | 49.0 | Self Enquiry | 3 | 9.0 | Small Business | Male | 3 | 5.0 | Deluxe | 4.0 | Unmarried | 2.0 | 1 | 1 | 1 | 1.0 | Manager | 26576.0 |
| 4884 | 204884 | 1 | 28.0 | Company Invited | 1 | 31.0 | Salaried | Male | 4 | 5.0 | Basic | 3.0 | Single | 3.0 | 1 | 3 | 1 | 2.0 | Executive | 21212.0 |
| 4885 | 204885 | 1 | 52.0 | Self Enquiry | 3 | 17.0 | Salaried | Female | 4 | 4.0 | Standard | 4.0 | Married | 7.0 | 0 | 1 | 1 | 3.0 | Senior Manager | 31820.0 |
| 4886 | 204886 | 1 | 19.0 | Self Enquiry | 3 | 16.0 | Small Business | Male | 3 | 4.0 | Basic | 3.0 | Single | 3.0 | 0 | 5 | 0 | 2.0 | Executive | 20289.0 |
| 4887 | 204887 | 1 | 36.0 | Self Enquiry | 1 | 14.0 | Salaried | Male | 4 | 4.0 | Basic | 4.0 | Unmarried | 3.0 | 1 | 3 | 1 | 2.0 | Executive | 24041.0 |
df.shape
(4888, 20)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 4888 non-null int64 1 ProdTaken 4888 non-null int64 2 Age 4662 non-null float64 3 TypeofContact 4863 non-null object 4 CityTier 4888 non-null int64 5 DurationOfPitch 4637 non-null float64 6 Occupation 4888 non-null object 7 Gender 4888 non-null object 8 NumberOfPersonVisiting 4888 non-null int64 9 NumberOfFollowups 4843 non-null float64 10 ProductPitched 4888 non-null object 11 PreferredPropertyStar 4862 non-null float64 12 MaritalStatus 4888 non-null object 13 NumberOfTrips 4748 non-null float64 14 Passport 4888 non-null int64 15 PitchSatisfactionScore 4888 non-null int64 16 OwnCar 4888 non-null int64 17 NumberOfChildrenVisiting 4822 non-null float64 18 Designation 4888 non-null object 19 MonthlyIncome 4655 non-null float64 dtypes: float64(7), int64(7), object(6) memory usage: 763.9+ KB
"object" datatype columns.¶cols_obj = df.select_dtypes(["object"])
cols_obj.columns
Index(['TypeofContact', 'Occupation', 'Gender', 'ProductPitched',
'MaritalStatus', 'Designation'],
dtype='object')
# Checking value counts of categorical variables
for i in cols_obj:
print(f'Unique values in "{i}" are :')
df_concat = pd.concat(
[
df[i].value_counts().to_frame(),
round(
df[i].value_counts(normalize=True).to_frame().rename(columns={i: "%"})
* 100,
2,
),
],
axis=1,
)
print(df_concat)
print("*" * 50)
Unique values in "TypeofContact" are :
TypeofContact %
Self Enquiry 3444 70.82
Company Invited 1419 29.18
**************************************************
Unique values in "Occupation" are :
Occupation %
Salaried 2368 48.45
Small Business 2084 42.64
Large Business 434 8.88
Free Lancer 2 0.04
**************************************************
Unique values in "Gender" are :
Gender %
Male 2916 59.66
Female 1817 37.17
Fe Male 155 3.17
**************************************************
Unique values in "ProductPitched" are :
ProductPitched %
Basic 1842 37.68
Deluxe 1732 35.43
Standard 742 15.18
Super Deluxe 342 7.00
King 230 4.71
**************************************************
Unique values in "MaritalStatus" are :
MaritalStatus %
Married 2340 47.87
Divorced 950 19.44
Single 916 18.74
Unmarried 682 13.95
**************************************************
Unique values in "Designation" are :
Designation %
Executive 1842 37.68
Manager 1732 35.43
Senior Manager 742 15.18
AVP 342 7.00
VP 230 4.71
**************************************************
Observations
TypeofContact the customer has been contacted only 30% of the time. The most frequent practice (70%) has been the customer contacting the company by themselves.Occupation are 'Salaried' and 'Small Business' altogether at 90% and a very low percentage for 'Large Business' and 'rare' occassions for 'Free Lancer'.Gender has a third mistyped value "Fe Male". We will fix this below. ProductPitched are 'Basic' and 'Deluxe' with almost 73% altogether. The rest is distributed among 'Standard', 'Super Deluxe', and 'King'.MaritalStatus we observe a preference between 'Married' customers around 48%. The sum of 'Divorce', 'Single', and 'Unmarried' amount to 52%.Designation.# Removing `CustomerID` variable from the dataset
df.drop(axis=1, columns=["CustomerID"], inplace=True)
# fixing 'Fe Male' typo on 'Gender'
df["Gender"] = df["Gender"].apply(lambda x: "Female" if x == "Fe Male" else x)
# Describing only numerical variables
df.describe(include=[np.int64, np.float64]).T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ProdTaken | 4888.0 | 0.188216 | 0.390925 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Age | 4662.0 | 37.622265 | 9.316387 | 18.0 | 31.0 | 36.0 | 44.0 | 61.0 |
| CityTier | 4888.0 | 1.654255 | 0.916583 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 |
| DurationOfPitch | 4637.0 | 15.490835 | 8.519643 | 5.0 | 9.0 | 13.0 | 20.0 | 127.0 |
| NumberOfPersonVisiting | 4888.0 | 2.905074 | 0.724891 | 1.0 | 2.0 | 3.0 | 3.0 | 5.0 |
| NumberOfFollowups | 4843.0 | 3.708445 | 1.002509 | 1.0 | 3.0 | 4.0 | 4.0 | 6.0 |
| PreferredPropertyStar | 4862.0 | 3.581037 | 0.798009 | 3.0 | 3.0 | 3.0 | 4.0 | 5.0 |
| NumberOfTrips | 4748.0 | 3.236521 | 1.849019 | 1.0 | 2.0 | 3.0 | 4.0 | 22.0 |
| Passport | 4888.0 | 0.290917 | 0.454232 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| PitchSatisfactionScore | 4888.0 | 3.078151 | 1.365792 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
| OwnCar | 4888.0 | 0.620295 | 0.485363 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| NumberOfChildrenVisiting | 4822.0 | 1.187267 | 0.857861 | 0.0 | 1.0 | 1.0 | 2.0 | 3.0 |
| MonthlyIncome | 4655.0 | 23619.853491 | 5380.698361 | 1000.0 | 20346.0 | 22347.0 | 25571.0 | 98678.0 |
Observations
"ProdTaken" is the dependent variable - type integer. We will convert this to 'categorical'.CityTier, NumberOfPersonVisiting, NumberOfFollowups, PreferredPropertyStar, Passport, PitchSatisfactionScore, OwnCar, and NumberOfChildrenVisiting have a set of discrete values and can be converted to a data type 'categorical'.TypeofContact, Occupation, Gender, ProductPitched, MaritalStatus, and Designation are the other 'categorical' variables.NumberOfFollowups, PreferredPropertyStar, NumberOfTrips, and NumberOfChildrenVisiting are "float64" datatype but they should be "int64" datatype. We will change them.NumberOfFollowups, PreferredPropertyStar, NumberOfTrips, and NumberOfChildrenVisiting to integers.# list of float64 datatype to int64
features_float64 = [
"NumberOfFollowups",
"PreferredPropertyStar",
"NumberOfTrips",
"NumberOfChildrenVisiting",
]
# converting to int64 and printing #missing values
for feature in features_float64:
df[feature] = df[feature].astype(pd.Int64Dtype())
print(
f"feature: ['{feature}'] has {df[feature].isnull().sum()} missing (NaN) values."
)
feature: ['NumberOfFollowups'] has 45 missing (NaN) values. feature: ['PreferredPropertyStar'] has 26 missing (NaN) values. feature: ['NumberOfTrips'] has 140 missing (NaN) values. feature: ['NumberOfChildrenVisiting'] has 66 missing (NaN) values.
# Set of dependent/independent variables to be converted to categorical
col_cat = [
"ProdTaken",
"CityTier",
"NumberOfPersonVisiting",
"NumberOfFollowups",
"PreferredPropertyStar",
"Passport",
"PitchSatisfactionScore",
"OwnCar",
"NumberOfChildrenVisiting",
"TypeofContact",
"Occupation",
"Gender",
"ProductPitched",
"MaritalStatus",
"Designation",
]
# loop to convert indep. variables with discrete values to `categorical`
for col in col_cat:
df[col] = df[col].astype("category")
# Checking value counts of categorical variables
for i in col_cat:
print(f'Unique values in "{i}" are :')
df_concat = pd.concat(
[
df[i].value_counts().to_frame(),
round(
df[i].value_counts(normalize=True).to_frame().rename(columns={i: "%"})
* 100,
2,
),
],
axis=1,
)
print(df_concat)
print("*" * 50)
Unique values in "ProdTaken" are :
ProdTaken %
0 3968 81.18
1 920 18.82
**************************************************
Unique values in "CityTier" are :
CityTier %
1 3190 65.26
3 1500 30.69
2 198 4.05
**************************************************
Unique values in "NumberOfPersonVisiting" are :
NumberOfPersonVisiting %
3 2402 49.14
2 1418 29.01
4 1026 20.99
1 39 0.80
5 3 0.06
**************************************************
Unique values in "NumberOfFollowups" are :
NumberOfFollowups %
4 2068 42.70
3 1466 30.27
5 768 15.86
2 229 4.73
1 176 3.63
6 136 2.81
**************************************************
Unique values in "PreferredPropertyStar" are :
PreferredPropertyStar %
3 2993 61.56
5 956 19.66
4 913 18.78
**************************************************
Unique values in "Passport" are :
Passport %
0 3466 70.91
1 1422 29.09
**************************************************
Unique values in "PitchSatisfactionScore" are :
PitchSatisfactionScore %
3 1478 30.24
5 970 19.84
1 942 19.27
4 912 18.66
2 586 11.99
**************************************************
Unique values in "OwnCar" are :
OwnCar %
1 3032 62.03
0 1856 37.97
**************************************************
Unique values in "NumberOfChildrenVisiting" are :
NumberOfChildrenVisiting %
1 2080 43.14
2 1335 27.69
0 1082 22.44
3 325 6.74
**************************************************
Unique values in "TypeofContact" are :
TypeofContact %
Self Enquiry 3444 70.82
Company Invited 1419 29.18
**************************************************
Unique values in "Occupation" are :
Occupation %
Salaried 2368 48.45
Small Business 2084 42.64
Large Business 434 8.88
Free Lancer 2 0.04
**************************************************
Unique values in "Gender" are :
Gender %
Male 2916 59.66
Female 1972 40.34
**************************************************
Unique values in "ProductPitched" are :
ProductPitched %
Basic 1842 37.68
Deluxe 1732 35.43
Standard 742 15.18
Super Deluxe 342 7.00
King 230 4.71
**************************************************
Unique values in "MaritalStatus" are :
MaritalStatus %
Married 2340 47.87
Divorced 950 19.44
Single 916 18.74
Unmarried 682 13.95
**************************************************
Unique values in "Designation" are :
Designation %
Executive 1842 37.68
Manager 1732 35.43
Senior Manager 742 15.18
AVP 342 7.00
VP 230 4.71
**************************************************
cat_vars = [
"TypeofContact",
"Occupation",
"Gender",
"ProductPitched",
"MaritalStatus",
"Designation",
]
for feature in cat_vars:
print(
f"feature: ['{feature}'] has {df[feature].isnull().sum()} missing (NaN) values."
)
feature: ['TypeofContact'] has 25 missing (NaN) values. feature: ['Occupation'] has 0 missing (NaN) values. feature: ['Gender'] has 0 missing (NaN) values. feature: ['ProductPitched'] has 0 missing (NaN) values. feature: ['MaritalStatus'] has 0 missing (NaN) values. feature: ['Designation'] has 0 missing (NaN) values.
"TypeofContact" has 25 missing (NaN) values."TypeofContact" with missing values¶# current possible values for "TypeofContact"
df["TypeofContact"].value_counts()
Self Enquiry 3444 Company Invited 1419 Name: TypeofContact, dtype: int64
'mode' of "TypeofContact" for the missing data in the column.# calculating the 'mode' for the feature
mode_TypeofContact = mode(df["TypeofContact"])[0][0]
print(f"The 'mode' for the feature \"TypeofContact\" is '{mode_TypeofContact}'")
df.loc[df["TypeofContact"].isnull(), "TypeofContact"] = mode_TypeofContact
The 'mode' for the feature "TypeofContact" is 'Self Enquiry'
# list of columns to encode
cols_to_encode = [
"TypeofContact",
"Occupation",
"Gender",
"ProductPitched",
"MaritalStatus",
"Designation",
]
# maps for columns to encode
TypeofContact_dict = {"Self Enquiry": 0, "Company Invited": 1}
Occupation_dict = {
"Salaried": 0,
"Small Business": 1,
"Large Business": 2,
"Free Lancer": 3,
}
Gender_dict = {"Male": 0, "Female": 1}
ProductPitched_dict = {
"Basic": 0,
"Deluxe": 1,
"Standard": 2,
"Super Deluxe": 3,
"King": 4,
}
MaritalStatus_dict = {"Married": 0, "Divorced": 1, "Single": 2, "Unmarried": 3}
Designation_dict = {
"Executive": 0,
"Manager": 1,
"Senior Manager": 2,
"AVP": 3,
"VP": 4,
}
# index of dictionaries of columns to encode
enc_list_dicts = {
0: TypeofContact_dict,
1: Occupation_dict,
2: Gender_dict,
3: ProductPitched_dict,
4: MaritalStatus_dict,
5: Designation_dict,
}
# encoding columns to encode
for i, feature in enumerate(cols_to_encode):
df[feature + "_num"] = df[feature].map(enc_list_dicts[i])
print(feature, "\n", df[feature + "_num"].value_counts())
print(80 * "*")
TypeofContact 0 3469 1 1419 Name: TypeofContact_num, dtype: int64 ******************************************************************************** Occupation 0 2368 1 2084 2 434 3 2 Name: Occupation_num, dtype: int64 ******************************************************************************** Gender 0 2916 1 1972 Name: Gender_num, dtype: int64 ******************************************************************************** ProductPitched 0 1842 1 1732 2 742 3 342 4 230 Name: ProductPitched_num, dtype: int64 ******************************************************************************** MaritalStatus 0 2340 1 950 2 916 3 682 Name: MaritalStatus_num, dtype: int64 ******************************************************************************** Designation 0 1842 1 1732 2 742 3 342 4 230 Name: Designation_num, dtype: int64 ********************************************************************************
df.drop(
labels=[
"TypeofContact",
"Occupation",
"Gender",
"ProductPitched",
"MaritalStatus",
"Designation",
],
axis=1,
inplace=True,
)
df
| ProdTaken | Age | CityTier | DurationOfPitch | NumberOfPersonVisiting | NumberOfFollowups | PreferredPropertyStar | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | MonthlyIncome | TypeofContact_num | Occupation_num | Gender_num | ProductPitched_num | MaritalStatus_num | Designation_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 41.0 | 3 | 6.0 | 3 | 3 | 3 | 1 | 1 | 2 | 1 | 0 | 20993.0 | 0 | 0 | 1 | 1 | 2 | 1 |
| 1 | 0 | 49.0 | 1 | 14.0 | 3 | 4 | 4 | 2 | 0 | 3 | 1 | 2 | 20130.0 | 1 | 0 | 0 | 1 | 1 | 1 |
| 2 | 1 | 37.0 | 1 | 8.0 | 3 | 4 | 3 | 7 | 1 | 3 | 0 | 0 | 17090.0 | 0 | 3 | 0 | 0 | 2 | 0 |
| 3 | 0 | 33.0 | 1 | 9.0 | 2 | 3 | 3 | 2 | 1 | 5 | 1 | 1 | 17909.0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 4 | 0 | NaN | 1 | 8.0 | 2 | 3 | 4 | 1 | 0 | 5 | 1 | 0 | 18468.0 | 0 | 1 | 0 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4883 | 1 | 49.0 | 3 | 9.0 | 3 | 5 | 4 | 2 | 1 | 1 | 1 | 1 | 26576.0 | 0 | 1 | 0 | 1 | 3 | 1 |
| 4884 | 1 | 28.0 | 1 | 31.0 | 4 | 5 | 3 | 3 | 1 | 3 | 1 | 2 | 21212.0 | 1 | 0 | 0 | 0 | 2 | 0 |
| 4885 | 1 | 52.0 | 3 | 17.0 | 4 | 4 | 4 | 7 | 0 | 1 | 1 | 3 | 31820.0 | 0 | 0 | 1 | 2 | 0 | 2 |
| 4886 | 1 | 19.0 | 3 | 16.0 | 3 | 4 | 3 | 3 | 0 | 5 | 0 | 2 | 20289.0 | 0 | 1 | 0 | 0 | 2 | 0 |
| 4887 | 1 | 36.0 | 1 | 14.0 | 4 | 4 | 4 | 3 | 1 | 3 | 1 | 2 | 24041.0 | 0 | 0 | 0 | 0 | 3 | 0 |
4888 rows × 19 columns
df.isnull().sum()
ProdTaken 0 Age 226 CityTier 0 DurationOfPitch 251 NumberOfPersonVisiting 0 NumberOfFollowups 45 PreferredPropertyStar 26 NumberOfTrips 140 Passport 0 PitchSatisfactionScore 0 OwnCar 0 NumberOfChildrenVisiting 66 MonthlyIncome 233 TypeofContact_num 0 Occupation_num 0 Gender_num 0 ProductPitched_num 0 MaritalStatus_num 0 Designation_num 0 dtype: int64
Age, DurationOfPitch, NumberOfFollowups, PreferredPropertyStar, NumberOfTrips, NumberOfChildrenVisiting, and MonthlyIncome.# list of columns with missing values
miss_cols = [
"Age",
"DurationOfPitch",
"NumberOfFollowups",
"PreferredPropertyStar",
"NumberOfTrips",
"NumberOfChildrenVisiting",
"MonthlyIncome",
]
# list of data types of columns with missing values
miss_cols_dtype = []
for col in miss_cols:
miss_cols_dtype.append(f"{df[col].dtype}")
# dictionary containing columns with missing values and their data types
miss_cols_dict = dict(zip(miss_cols, miss_cols_dtype))
miss_cols_dict
{'Age': 'float64',
'DurationOfPitch': 'float64',
'MonthlyIncome': 'float64',
'NumberOfChildrenVisiting': 'category',
'NumberOfFollowups': 'category',
'NumberOfTrips': 'Int64',
'PreferredPropertyStar': 'category'}
for k, v in miss_cols_dict.items():
print(k, df[df[k].isnull()].shape[0])
Age 226 DurationOfPitch 251 NumberOfFollowups 45 PreferredPropertyStar 26 NumberOfTrips 140 NumberOfChildrenVisiting 66 MonthlyIncome 233
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.2f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
def histogram_boxplot(data, feature, figsize=(16, 8), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
if bins:
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
)
else:
sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to plot univariate analysis based on feature data type
def plot_univariate(data, feature):
"""
Plot univariate based on feature data type
data: dataframe
feature: dataframe column
"""
print(data[feature].dtype)
if data[feature].dtype in ("category", "Int64", "object"):
labeled_barplot(data, feature, perc=True)
else:
histogram_boxplot(data, feature, kde=True)
return
# function to find outliers values from the feature
def get_outliers(data, feature, factor=1.5, include_indexes=False):
"""
function to find outliers
data: dataframe
feature: dataframe column
"""
p25, p50, p75 = values = data[feature].describe().to_numpy()[-4:-1].tolist()
iqr = p75 - p25
loww = p25 - iqr * factor
uppw = p75 + iqr * factor
filt = (data[feature] > uppw) | (data[feature] < loww)
if include_indexes == True:
return data.loc[filt, feature].tolist(), data.loc[filt, feature].index
return data.loc[filt, feature].tolist(), []
# function to extract some useful statistics from the feature distribution
def get_stats(data, feature):
"""
Get mean, stdev, median, variance, and mode of a 'feature'
data: dataframe
feature: dataframe column
"""
avg = np.nanmean(data[feature])
stdev = np.nanstd(data[feature])
median = np.nanmedian(data[feature])
var = np.nanvar(data[feature])
values, counts = np.unique(data[feature], return_counts=True)
mode = values[np.argmax(counts)]
return avg, stdev, median, var, mode
'ProdTaken'¶plot_univariate(df, "ProdTaken")
category
'Age'¶labeled_barplot(df, "Age", perc=True)
plot_univariate(df, "Age")
float64
Age we have a right skewed distribution with no visible outliers and a 'little' hump on the left side of the median as a signal of a possible bi-modal distribution.'CityTier'¶plot_univariate(df, "CityTier")
category
CityTier the dominant most frequent is tier 1 with 65% followed by tier 3 with around 31% and a small percentage on tier 2 at 4%.'DurationOfPitch'¶labeled_barplot(df, "DurationOfPitch", perc=True)
plot_univariate(df, "DurationOfPitch")
float64
DurationOfPitch we have a slightly skewed distribution (the outliers makes the tail very long). We will remove these outliers observations as it may be a exaggerated value for a sales pitch.'DurationOfPitch'¶# getting outliers values and location indexes of the outliers
outliers, bad_indexes = get_outliers(
df, "DurationOfPitch", factor=2.5, include_indexes=True
)
outliers, bad_indexes
([126.0, 127.0], Int64Index([1434, 3878], dtype='int64'))
'DurationOfPitch' higher than two hours. We will remove them.# display rows with 'bad_indexes'
df.loc[bad_indexes]
| ProdTaken | Age | CityTier | DurationOfPitch | NumberOfPersonVisiting | NumberOfFollowups | PreferredPropertyStar | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | MonthlyIncome | TypeofContact_num | Occupation_num | Gender_num | ProductPitched_num | MaritalStatus_num | Designation_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1434 | 0 | NaN | 3 | 126.0 | 2 | 3 | 3 | 3 | 0 | 1 | 1 | 1 | 18482.0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 3878 | 0 | 53.0 | 3 | 127.0 | 3 | 4 | 3 | 4 | 0 | 1 | 1 | 2 | 22160.0 | 1 | 0 | 0 | 0 | 0 | 0 |
# removing highly extreme outliers
df.drop(axis=0, index=bad_indexes, inplace=True)
# get some stats from the "DurationOfPitch" distribution without the presence of "outliers"
dfdata = df.loc[~df.index.isin(bad_indexes)].copy()
avg, std, median, var, mode = get_stats(dfdata, "DurationOfPitch")
avg, std, median, mode
(15.442934196332255, 8.20244992074016, 13.0, 9.0)
'DurationOfPitch' we will utilize the mode of 9 minutes.'DurationOfPitch'¶labeled_barplot(df, "DurationOfPitch", perc=True)
plot_univariate(df, "DurationOfPitch")
float64
'DurationOfPitch' is right-skeweed withh a second hump around the mean value indicating a bi-modal distribution.'NumberOfPersonVisiting'¶plot_univariate(df, "NumberOfPersonVisiting")
category
NumberOfPersonVisiting the most frequent number of visitors in the group is 3 people with 49% of the time, followed by a group of 2 people with 29% frequency and a group of 4 people with 21% of the time.'NumberOfFollowups'¶plot_univariate(df, "NumberOfFollowups")
category
NumberOfFollowups the most frequent number of follow ups is 4 with ~42% of the time, followed by 3-times with 30% and 5-times with ~16%. The rest of NumberOfFollowups with very low percentage are 1@4%, 2@5%, and 6@3%.'PreferredPropertyStar'¶plot_univariate(df, "PreferredPropertyStar")
category
PreferredPropertyStar the most frequent category is 3-stars with 61% of the time, followed by 5-stars with ~20% and 4-stars with 19%.'NumberOfTrips'¶plot_univariate(df, "NumberOfTrips")
Int64
NumberOfTrips the most frequent number is 2-trips with ~30% of the time, followed by 3-trips @ 22%, 1-trip @ ~13%, and [4, 5, 6, 7, and 8]-trips @ 10%, 10%, 7%, 5%, and 2% respectively.NumberOfTrips with values: 19, 20, 21, 22 at the same percentage of 0.02%. We will remove these observations as they may represent outliers with a very very low percentage of presence.indexes_todrop = df[df["NumberOfTrips"].isin([19, 20, 21, 22])].index
df.drop(axis=0, index=indexes_todrop, inplace=True)
'Passport'¶plot_univariate(df, "Passport")
category
'Passport' almost 71% of customers don't have one. And only 29% has a 'Passport'.'PitchSatisfactionScore'¶plot_univariate(df, "PitchSatisfactionScore")
category
On 'PitchSatisfactionScore' the most frequent rating is 3 with a frequency of 30%. Then, the frequency for ratings 1, 4, and 5 is 19%, 19% and 20% respectively.
The less frequent rating given by customers is 2 with a frequency of 12% of the time.
'OwnCar'¶plot_univariate(df, "OwnCar")
category
'OwnCar', 62% owns a car, and 38% doesn't.'NumberOfChildrenVisiting'¶plot_univariate(df, "NumberOfChildrenVisiting")
category
'NumberOfChildrenVisiting', the most frequent number of chidren accompanying visitors is 1 with ~43% of the time, followed by 2 @ 27%, 0 @ 22% and the less frequent is 3 with ~7%.'MonthlyIncome'¶plot_univariate(df, "MonthlyIncome")
float64
'MonthlyIncome', we have an interesting display where the distribution of values being right-skewed showing several humps side-by-side.behind the median and a couple of 'extreme' values far away from the median.'analysis of outliers' on this feature.'MonthlyIncome'¶# getting outliers values and location indexes of the outliers
# Note: using a higher `amplitude` for the IQR multiplier of 2.5 instead of the common 1.5
outliers, bad_indexes = get_outliers(
df, "MonthlyIncome", factor=2.5, include_indexes=True
)
# extreme outliers and indexes of their location
outliers, bad_indexes
([95000.0, 1000.0, 98678.0, 4678.0], Int64Index([38, 142, 2482, 2586], dtype='int64'))
# observing the rows with the `bad_indexes` considered as `extreme outliers`
df.loc[
bad_indexes,
[
"Passport",
"MonthlyIncome",
"Designation_num",
"PreferredPropertyStar",
"NumberOfPersonVisiting",
],
]
| Passport | MonthlyIncome | Designation_num | PreferredPropertyStar | NumberOfPersonVisiting | |
|---|---|---|---|---|---|
| 38 | 1 | 95000.0 | 0 | NaN | 2 |
| 142 | 1 | 1000.0 | 1 | 3 | 2 |
| 2482 | 1 | 98678.0 | 0 | 5 | 3 |
| 2586 | 1 | 4678.0 | 1 | 3 | 3 |
The 'MonthlyIncome' of the extreme outliers correspond to Executive (Designation_num = 0) and Manager (Designation_num = 1).
Those observations doesn't seem to be the right numbers for 'MonthlyIncome' so, we will remove these observations.
# drop extreme outliers out of the 'MonthlyIncome' column.
df.drop(index=bad_indexes, inplace=True)
'MonthlyIncome' to 1,000's of 'MonthlyIncome'¶df["MonthlyIncome"] = df["MonthlyIncome"] / 1000.00
'MonthlyIncome'¶plot_univariate(df, "MonthlyIncome")
float64
Now the feature 'MonthlyIncome' has an interesting display with a right-skewed distribution showing a few humps side-by-side.
The box plot shows some outliers that are permitted for this case as they may enrich the dataset insights and therefore they will be kept.
'TypeofContact_num'¶plot_univariate(df, "TypeofContact_num")
category
{v: k for k, v in TypeofContact_dict.items()}
{0: 'Self Enquiry', 1: 'Company Invited'}
reversed dictionary: {0: 'Self Enquiry', 1: 'Company Invited'}
On the feature 'TypeofContact_num', the class '0' has a presence of 71% while class '1' has 29%.
'Occupation_num'¶plot_univariate(df, "Occupation_num")
category
{v: k for k, v in Occupation_dict.items()}
{0: 'Salaried', 1: 'Small Business', 2: 'Large Business', 3: 'Free Lancer'}
'Occupation_num', the class = '0' has a presence of 48% corresponding to 'Salaried' while class = '1' has almost 43% corresponding to 'Small Business'.'2'-'Large Business' has almost 9% frequency and the class = '3'-'Free Lancer' will be removed as it has only 0.04% presence and no relevance for this analysis.'Gender_num'¶plot_univariate(df, "Gender_num")
category
{v: k for k, v in Gender_dict.items()}
{0: 'Male', 1: 'Female'}
reversed dictionary: {0: 'Male', 1: 'Female'}
On the feature 'Gender_num', the class = '0'-Male has a presence of 60%, while class = '1'-Female has 40%.
'ProductPitched_num'¶plot_univariate(df, "ProductPitched_num")
category
{v: k for k, v in ProductPitched_dict.items()}
{0: 'Basic', 1: 'Deluxe', 2: 'Standard', 3: 'Super Deluxe', 4: 'King'}
reversed dictionary: {0: 'Basic', 1: 'Deluxe', 2: 'Standard', 3: 'Super Deluxe', 4: 'King'}
On the feature 'ProductPitched_num', the class = '0'-Basic has the most presence of 38%, followed by class = '1'-Deluxe with 36%.
For the class = '2'-Standard 15%, class = '3'-Super Deluxe 7%, and class = '4'-King a 5%
'MaritalStatus_num'¶plot_univariate(df, "MaritalStatus_num")
category
{v: k for k, v in MaritalStatus_dict.items()}
{0: 'Married', 1: 'Divorced', 2: 'Single', 3: 'Unmarried'}
reversed dictionary: {0: 'Married', 1: 'Divorced', 2: 'Single', 3: 'Unmarried'}
Married represent 48%, followed by Divorced, Single with 19% & 19% and lastly Unmarried with 14%.
'Designation_num'¶plot_univariate(df, "Designation_num")
category
{v: k for k, v in Designation_dict.items()}
{0: 'Executive', 1: 'Manager', 2: 'Senior Manager', 3: 'AVP', 4: 'VP'}
reversed dictionary: {0: 'Executive', 1: 'Manager', 2: 'Senior Manager', 3: 'AVP', 4: 'VP'}
Executive and Manager represents 38% and 35% percent respectively of the customers requesting packages.
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 4878 entries, 0 to 4887 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProdTaken 4878 non-null category 1 Age 4653 non-null float64 2 CityTier 4878 non-null category 3 DurationOfPitch 4627 non-null float64 4 NumberOfPersonVisiting 4878 non-null category 5 NumberOfFollowups 4833 non-null category 6 PreferredPropertyStar 4853 non-null category 7 NumberOfTrips 4738 non-null Int64 8 Passport 4878 non-null category 9 PitchSatisfactionScore 4878 non-null category 10 OwnCar 4878 non-null category 11 NumberOfChildrenVisiting 4812 non-null category 12 MonthlyIncome 4645 non-null float64 13 TypeofContact_num 4878 non-null category 14 Occupation_num 4878 non-null category 15 Gender_num 4878 non-null category 16 ProductPitched_num 4878 non-null category 17 MaritalStatus_num 4878 non-null category 18 Designation_num 4878 non-null category dtypes: Int64(1), category(15), float64(3) memory usage: 398.3 KB
'Age', 'DurationOfPitch', 'NumberOfFollowups', 'PreferredPropertyStar', 'NumberOfTrips', 'NumberOfChildrenVisiting', and 'MonthlyIncome' are the columns with missing data.
We will impute data in those columns through a well known technique called K-nearest-neighbour algorithm.
from sklearn.impute import KNNImputer
# # making a safe copy before proceeding with the knn algorithm
df_safe = df.copy()
# initialize knn imputer
imputer = KNNImputer(n_neighbors=10)
# columns of the dftest dataframe
columns = df_safe.columns
df_filled = imputer.fit_transform(df_safe)
# df_filled now is a dataframe of only float64 values
# we will re-cast those columns to its original datatype
df_filled = pd.DataFrame(df_filled, columns=columns)
# converting to original data types
for col in df_filled.columns:
if df[col].dtype == "category":
df_filled[col] = (
pd.to_numeric(df_filled[col]).astype(np.int64).astype("category")
)
if df[col].dtype == "float64":
df_filled[col] = df_filled[col].astype("float64")
if df[col].dtype == "Int64":
df_filled[col] = df_filled[col].fillna(0).astype(np.int64, errors="ignore")
# converting to original data types
for col in df_filled.columns:
if df[col].dtype == "category":
df_filled[col] = pd.to_numeric(df_filled[col]).astype(np.int64)
if df[col].dtype == "float64":
df_filled[col] = df_filled[col].astype("float64")
# converting 'Age' and 'DurationOfPitch' to int64
df_filled["Age"] = df_filled["Age"].astype(np.int64)
df_filled["DurationOfPitch"] = df_filled["DurationOfPitch"].astype(np.int64)
df_filled.head()
| ProdTaken | Age | CityTier | DurationOfPitch | NumberOfPersonVisiting | NumberOfFollowups | PreferredPropertyStar | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | MonthlyIncome | TypeofContact_num | Occupation_num | Gender_num | ProductPitched_num | MaritalStatus_num | Designation_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 41 | 3 | 6 | 3 | 3 | 3 | 1 | 1 | 2 | 1 | 0 | 20.993 | 0 | 0 | 1 | 1 | 2 | 1 |
| 1 | 0 | 49 | 1 | 14 | 3 | 4 | 4 | 2 | 0 | 3 | 1 | 2 | 20.130 | 1 | 0 | 0 | 1 | 1 | 1 |
| 2 | 1 | 37 | 1 | 8 | 3 | 4 | 3 | 7 | 1 | 3 | 0 | 0 | 17.090 | 0 | 3 | 0 | 0 | 2 | 0 |
| 3 | 0 | 33 | 1 | 9 | 2 | 3 | 3 | 2 | 1 | 5 | 1 | 1 | 17.909 | 1 | 0 | 1 | 0 | 1 | 0 |
| 4 | 0 | 29 | 1 | 8 | 2 | 3 | 4 | 1 | 0 | 5 | 1 | 0 | 18.468 | 0 | 1 | 0 | 0 | 1 | 0 |
df_filled.tail()
| ProdTaken | Age | CityTier | DurationOfPitch | NumberOfPersonVisiting | NumberOfFollowups | PreferredPropertyStar | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | MonthlyIncome | TypeofContact_num | Occupation_num | Gender_num | ProductPitched_num | MaritalStatus_num | Designation_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4873 | 1 | 49 | 3 | 9 | 3 | 5 | 4 | 2 | 1 | 1 | 1 | 1 | 26.576 | 0 | 1 | 0 | 1 | 3 | 1 |
| 4874 | 1 | 28 | 1 | 31 | 4 | 5 | 3 | 3 | 1 | 3 | 1 | 2 | 21.212 | 1 | 0 | 0 | 0 | 2 | 0 |
| 4875 | 1 | 52 | 3 | 17 | 4 | 4 | 4 | 7 | 0 | 1 | 1 | 3 | 31.820 | 0 | 0 | 1 | 2 | 0 | 2 |
| 4876 | 1 | 19 | 3 | 16 | 3 | 4 | 3 | 3 | 0 | 5 | 0 | 2 | 20.289 | 0 | 1 | 0 | 0 | 2 | 0 |
| 4877 | 1 | 36 | 1 | 14 | 4 | 4 | 4 | 3 | 1 | 3 | 1 | 2 | 24.041 | 0 | 0 | 0 | 0 | 3 | 0 |
# copying imputed dataframe back to the original name
df = df_filled.copy()
df.dtypes
ProdTaken int64 Age int64 CityTier int64 DurationOfPitch int64 NumberOfPersonVisiting int64 NumberOfFollowups int64 PreferredPropertyStar int64 NumberOfTrips int64 Passport int64 PitchSatisfactionScore int64 OwnCar int64 NumberOfChildrenVisiting int64 MonthlyIncome float64 TypeofContact_num int64 Occupation_num int64 Gender_num int64 ProductPitched_num int64 MaritalStatus_num int64 Designation_num int64 dtype: object
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
# print(tab1)
# print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
plt.legend(
loc="lower left",
frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
plt.figure(figsize=(16, 10))
sns.heatmap(
df[df.columns[1:]].corr(),
annot=True,
vmin=-1,
vmax=1,
fmt=".2f",
cmap="Spectral",
)
plt.show()
We have to make some decisions before continuing the EDA process while doing Bi-Variate and Multi-Variate Analysis.
The correlation plot above shows a couple of interesting relationships among pairs of features.
We will be removing features that shows a high correlation between them and leave only one of them.
Starting with 'NumberOfChildrenVisiting' having a +60% correlation with 'NumberOfPersonVisiting'. We will eliminate the former one.
The feature 'MonthlyIncome' is positive correlated @ 86% with 'ProductPitched_num' and 'Designation_num'. We will eliminate the latter two.
# drop highly correlated columns as indicated in the section before
cols = ["NumberOfChildrenVisiting", "ProductPitched_num", "Designation_num"]
df.drop(axis=1, inplace=True, columns=cols)
df.columns
Index(['ProdTaken', 'Age', 'CityTier', 'DurationOfPitch',
'NumberOfPersonVisiting', 'NumberOfFollowups', 'PreferredPropertyStar',
'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar',
'MonthlyIncome', 'TypeofContact_num', 'Occupation_num', 'Gender_num',
'MaritalStatus_num'],
dtype='object')
plt.figure(figsize=(16, 10))
sns.heatmap(
df[df.columns[1:]].corr(),
annot=True,
vmin=-1,
vmax=1,
fmt=".2f",
cmap="Spectral",
)
plt.show()
Highest correlated value is 0.49 between 'MonthlyIncome' and 'Age'.
'NumberOfFollowUps' and 'NumberOfPersonVisiting' are correlated up to 0.33.
'MonthlyIncome' and 'NumberOfPersonVisiting' are correlated up to 0.22.
'NumberOfTrips' and 'NumberOfPersonVisiting' are correlated at 0.19.
'NumberOfTrips' and 'NumberOfFollowUps' are correlated at 0.14.
'MonthlyIncome' and 'NumberOfTrips' are correlated at 0.13.
stacked_barplot(df, "Age", "ProdTaken")
stacked_barplot(df, "CityTier", "ProdTaken")
tiers the change is around 25%.stacked_barplot(df, "DurationOfPitch", "ProdTaken")
stacked_barplot(df, "NumberOfPersonVisiting", "ProdTaken")
stacked_barplot(df, "NumberOfFollowups", "ProdTaken")
stacked_barplot(df, "PreferredPropertyStar", "ProdTaken")
stacked_barplot(df, "NumberOfTrips", "ProdTaken")
stacked_barplot(df, "Passport", "ProdTaken")
stacked_barplot(df, "PitchSatisfactionScore", "ProdTaken")
stacked_barplot(df, "OwnCar", "ProdTaken")
# stacked_barplot(df, "MonthlyIncome", "ProdTaken")
plt.figure(figsize=(10, 5))
sns.boxplot(x="ProdTaken", y="MonthlyIncome", data=df, showfliers=False)
plt.show()
'MonthlyIncome' is lower in the case a customer is buying a package.stacked_barplot(df, "TypeofContact_num", "ProdTaken")
'TypeOfContact' is 1, or 0.stacked_barplot(df, "Occupation_num", "ProdTaken")
{v: k for k, v in Occupation_dict.items()}
{0: 'Salaried', 1: 'Small Business', 2: 'Large Business', 3: 'Free Lancer'}
stacked_barplot(df, "Gender_num", "ProdTaken")
{v: k for k, v in Gender_dict.items()}
{0: 'Male', 1: 'Female'}
stacked_barplot(df, "MaritalStatus_num", "ProdTaken")
{v: k for k, v in MaritalStatus_dict.items()}
{0: 'Married', 1: 'Divorced', 2: 'Single', 3: 'Unmarried'}
Higher chance of getting a package when 'MaritalStatus' is 'Single'
Second higher chance is for 'Unmarried'. Same chance is similar for 'Married' and 'Divorced'.
'Multivariate Analysis' is used to study more complex sets of data than what 'Univariate Analysis', and 'Bivariate Analysis' methods can handle.'ProdTaken'.sns.catplot(
x="ProdTaken", y="MonthlyIncome", data=df, kind="bar", hue="NumberOfFollowups"
)
plt.xticks()
plt.show()
It seems the number of follow ups is not a deciding factor to buy a package.
In cases where a package is not bought (ProdTaken = 0) the higher salaries along with 2, 4, 5, and 6 FollowUps is not deciding factor.
However, when a package is bought, a number of FollowUps of 2, and 6 are always witnessing a buying of a package.
in essence there may not be a connection between buying a package and the number of FollowUps regardles of the higher salary.
sns.catplot(
x="ProdTaken", y="MonthlyIncome", data=df, kind="bar", hue="NumberOfPersonVisiting"
)
plt.xticks()
plt.show()
A package can be taken by customers only in the cases where the number of visiting persons is 2, 3, or 4.
Although we can have customers not buying on the same number of people visiting, this may indicate we have another 'deciding' variable.
sns.catplot(x="ProdTaken", y="MonthlyIncome", data=df, kind="bar", hue="NumberOfTrips")
plt.xticks()
plt.show()
Gender was cleaned and fixed by eliminating a typo in 'Fe male'.
We have eliminated the 'CustomerID' as it is not required.
On Feature Engineering we have encoded categorical variables and make them ready for model buidling.
'Age', 'DurationOfPitch', 'NumberOfFollowups', 'PreferredPropertyStar', 'NumberOfTrips', 'NumberOfChildrenVisiting', and 'MonthlyIncome' are the columns with missing data.
We have performed Imputation in those columns through a well known technique called K-nearest-neighbour algorithm.
A little over 81% of the customers have not applied for a package yet.
On Age we have a right skewed distribution with no visible outliers and a 'little' hump on the left side of the median as a signal of a possible bi-modal distribution.
On CityTier the dominant most frequent is tier 1 with 65% followed by tier 3 with around 31% and a small percentage on tier 2 at 4%.
On DurationOfPitch we have a slightly skewed distribution (the outliers makes the tail very long). We will remove these outliers observations as it may be a exaggerated value for a sales pitch.
There are two observations with 'DurationOfPitch' higher than two hours. We will remove them.
We have removed observations containing highly extreme values.
For the missing data in the column 'DurationOfPitch' we will utilize the mode of 9 minutes.
Distribution for 'DurationOfPitch' is right-skeweed withh a second hump around the mean value indicating a bi-modal distribution.
On NumberOfPersonVisiting the most frequent number of visitors in the group is 3 people with 49% of the time, followed by a group of 2 people with 29% frequency and a group of 4 people with 21% of the time.
On NumberOfFollowups the most frequent number of follow ups is 4 with ~42% of the time, followed by 3-times with 30% and 5-times with ~16%. The rest of NumberOfFollowups with very low percentage are 1@4%, 2@5%, and 6@3%.
On PreferredPropertyStar the most frequent category is 3-stars with 61% of the time, followed by 5-stars with ~20% and 4-stars with 19%.
On NumberOfTrips the most frequent number is 2-trips with ~30% of the time, followed by 3-trips @ 22%, 1-trip @ ~13%, and [4, 5, 6, 7, and 8]-trips @ 10%, 10%, 7%, 5%, and 2% respectively.
However, we observe NumberOfTrips with values: 19, 20, 21, 22 at the same percentage of 0.02%. We will remove these observations as they may represent outliers with a very very low percentage of presence.
On 'Passport' almost 71% of customers don't have one. And only 29% has a 'Passport'.
On 'PitchSatisfactionScore' the most frequent rating is 3 with a frequency of 30%. Then, the frequency for ratings 1, 4, and 5 is 19%, 19% and 20% respectively.
The less frequent rating given by customers is 2 with a frequency of 12% of the time.
For the feature 'OwnCar', 62% owns a car, and 38% doesn't.
For the feature 'NumberOfChildrenVisiting', the most frequent number of chidren accompanying visitors is 1 with ~43% of the time, followed by 2 @ 27%, 0 @ 22% and the less frequent is 3 with ~7%.
For the feature 'MonthlyIncome', we have an interesting display where the distribution of values being right-skewed showing several humps side-by-side.
The box plot shows a few significant outliers behind the median and a couple of 'extreme' values far away from the median. It is a candidate for Outliers Analysis.
The 'MonthlyIncome' of the extreme outliers correspond to Executive (Designation_num = 0) and Manager (Designation_num = 1).
Those observations doesn't seem to be the right numbers for 'MonthlyIncome' so, we will remove these observations.
Now the feature 'MonthlyIncome' has an interesting display with a right-skewed distribution showing a few humps side-by-side.
The box plot shows some outliers that are permitted for this case as they may enrich the dataset insights and therefore they will be kept.
Now the feature 'MonthlyIncome' has an interesting display with a right-skewed distribution showing a few humps side-by-side.
The box plot shows some outliers that are permitted for this case as they may enrich the dataset insights and therefore they will be kept.
On the feature 'Occupation_num', the class = '0' has a presence of 48% corresponding to 'Salaried' while class = '1' has almost 43% corresponding to 'Small Business'.
The class = '2'-'Large Business' has almost 9% frequency and the class = '3'-'Free Lancer' will be removed as it has only 0.04% presence and no relevance for this analysis.
On the feature 'Gender_num', the class = '0'-Male has a presence of 60%, while class = '1'-Female has 40%.
On the feature 'ProductPitched_num', the class = '0'-Basic has the most presence of 38%, followed by class = '1'-Deluxe with 36%.
For the class = '2'-Standard 15%, class = '3'-Super Deluxe 7%, and class = '4'-King a 5%
Married represent 48%, followed by Divorced, Single with 19% & 19% and lastly Unmarried with 14%.
Executive and Manager represents 38% and 35% percent respectively of the customers requesting packages.
We have to make some decisions before continuing the EDA process while doing Bi-Variate and Multi-Variate Analysis.
The correlation plot above shows a couple of interesting relationships among pairs of features.
We will be removing features that shows a high correlation between them and leave only one of them.
Starting with 'NumberOfChildrenVisiting' having a +60% correlation with 'NumberOfPersonVisiting'. We will eliminate the former one.
The feature 'MonthlyIncome' is positive correlated @ 86% with 'ProductPitched_num' and 'Designation_num'. We will eliminate the latter two.
dropped highly correlated columns: cols = ["NumberOfChildrenVisiting", "ProductPitched_num", "Designation_num"]
Possibility of a package being taken is reduced as the customer age.
Possibility of a package being taken is slightly lower on 'CityTier' = 1, while on the other tiers the change is around 25%.
There is a high variability in the length of the pitch compared to the customer buying a package or not.
The chance of buying a package is higher when 'NumberOfPersonVisiting' is 2, 3, or 4.
The chance of buying a package is higher than 20% when 'NumberOfFollowups' is 5, or 6.
The chance of buying a package is higher than 20% when 'PreferredPropertyStar' is 4, or 5.
The chance of buying a package is too variable and may not depend heavily on the 'NumberOfTrips' made. However 'NumberOfTrips' equal to 7, or 8 looks more promissing than others.
The chance of buying a package is much higher when customer has a 'Passport'.
When the 'PitchSatisfactionScore' given is 3, or 5 the chance of buying a package is much higher.
There is not significant difference when the customer own/not-own a car to increase chance of buying a package.
The median of 'MonthlyIncome' is lower in the case a customer is buying a package.
Not too much different chance of getting a package if the 'TypeOfContact' is 1, or 0.
Definitely chances are very high of getting a package when 'Occupation' is 'Free Lancer' and almost 30% chance when 'Occupation' is 'Large Business'. However, 'Free Lancer' is a rare situation of being present.
Almost same chance of getting a package when 'Gender' is either 'Male' or 'Female'. Although for 'Male' is slightly higher chance.
Higher chance of getting a package when 'MaritalStatus' is 'Single'
Second higher chance is for 'Unmarried'. Same chance is similar for 'Married' and 'Divorced'.
It seems the number of follow ups is not a deciding factor to buy a package.
In cases where a package is not bought (ProdTaken = 0) the higher salaries along with 2, 4, 5, and 6 FollowUps is not deciding factor.
However, when a package is bought, a number of FollowUps of 2, and 6 are always witnessing a buying of a package.
in essence there may not be a connection between buying a package and the number of FollowUps regardles of the higher salary.
A package can be taken by customers only in the cases where the number of visiting persons is 2, 3, or 4.
Although we can have customers not buying on the same number of people visiting, this may indicate we have another 'deciding' variable.
Although we have customers buying or not buying with the same number of trips, there is a higher variance when the customer decide to buy a package.
Highest correlated value is 0.49 between 'MonthlyIncome' and 'Age'.
'NumberOfFollowUps' and 'NumberOfPersonVisiting' are correlated up to 0.33.
'MonthlyIncome' and 'NumberOfPersonVisiting' are correlated up to 0.22.
'NumberOfTrips' and 'NumberOfPersonVisiting' are correlated at 0.19.
'NumberOfTrips' and 'NumberOfFollowUps' are correlated at 0.14.
'MonthlyIncome' and 'NumberOfTrips' are correlated at 0.13.
stratify parameter to target variable in the train_test_split function.X = df.drop("ProdTaken", axis=1)
y = df.pop("ProdTaken")
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.30, random_state=1, stratify=y
)
print(X_train.shape, X_test.shape)
(3414, 15) (1464, 15)
y.value_counts(True)
0 0.811808 1 0.188192 Name: ProdTaken, dtype: float64
y_test.value_counts(True)
0 0.811475 1 0.188525 Name: ProdTaken, dtype: float64
## Function to create confusion matrix
def make_confusion_matrix(model, y_actual, labels=[1, 0]):
"""
model : classifier to predict values of X
y_actual : ground truth
"""
y_predict = model.predict(X_test)
cm = confusion_matrix(y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(
cm,
index=[i for i in ["Actual - No", "Actual - Yes"]],
columns=[i for i in ["Predicted - No", "Predicted - Yes"]],
)
group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2, 2)
plt.figure(figsize=(10, 7))
sns.heatmap(df_cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
## Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model, flag=True):
"""
model : classifier to predict values of X
"""
# defining an empty list to store train and test results
score_list = []
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
train_recall = recall_score(y_train, pred_train)
test_recall = recall_score(y_test, pred_test)
train_precision = precision_score(y_train, pred_train)
test_precision = precision_score(y_test, pred_test)
score_list.extend(
(
train_acc,
test_acc,
train_recall,
test_recall,
train_precision,
test_precision,
)
)
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ", model.score(X_train, y_train))
print("Accuracy on test set : ", model.score(X_test, y_test))
print("Recall on training set : ", recall_score(y_train, pred_train))
print("Recall on test set : ", recall_score(y_test, pred_test))
print("Precision on training set : ", precision_score(y_train, pred_train))
print("Precision on test set : ", precision_score(y_test, pred_test))
return score_list # returning the list with train and test scores
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
bagging models here - 'Bagging Classifier', 'Decision Tree Classifier' and 'Random Forest Classifier'.Accuracy, Precision and Recall but the metric of interest here is Recall.Recall - It gives the ratio of True positives to Actual positives, so high Recall implies low false negatives, i.e. low chances of predicting a 'buyer' of a package as 'non buyer'.bagging = BaggingClassifier(random_state=1)
bagging.fit(X_train, y_train)
BaggingClassifier(random_state=1)
confusion_matrix_sklearn(bagging, X_test, y_test)
bagging_model_train_perf = model_performance_classification_sklearn(
bagging, X_train, y_train
)
print("Training performance \n", bagging_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 0.99297 0.962617 1.0 0.980952
bagging_model_test_perf = model_performance_classification_sklearn(
bagging, X_test, y_test
)
print("Testing performance \n", bagging_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.894809 0.528986 0.858824 0.654709
Bagging Classifier with weighted decision tree
bagging_wt = BaggingClassifier(
base_estimator=DecisionTreeClassifier(
criterion="gini", class_weight={0: 0.81, 1: 0.19}, random_state=1
),
random_state=1,
)
bagging_wt.fit(X_train, y_train)
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight={0: 0.81,
1: 0.19},
random_state=1),
random_state=1)
confusion_matrix_sklearn(bagging_wt, X_test, y_test)
bagging_wt_model_train_perf = model_performance_classification_sklearn(
bagging_wt, X_train, y_train
)
print("Training performance \n", bagging_wt_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 0.994142 0.971963 0.996805 0.984227
bagging_wt_model_test_perf = model_performance_classification_sklearn(
bagging_wt, X_test, y_test
)
print("Testing performance \n", bagging_wt_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.901639 0.615942 0.817308 0.702479
If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes.
In this case, we can pass a dictionary {0:0.81,1:0.19} to the model to specify the weight of each class and the decision tree will give more weightage to class 0.
class_weight is a hyperparameter for the decision tree classifier.
dtree = DecisionTreeClassifier(
criterion="gini", class_weight={0: 0.81, 1: 0.19}, random_state=1
)
dtree.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.81, 1: 0.19}, random_state=1)
confusion_matrix_sklearn(dtree, X_test, y_test)
Confusion Matrix -
Employee left and the model predicted it correctly that is employee will attrite : True Positive (observed=1,predicted=1)
Employee didn't leave and the model predicted employee will attrite : False Positive (observed=0,predicted=1)
Employee didn't leave and the model predicted employee will not attrite : True Negative (observed=0,predicted=0)
Employee left and the model predicted that employee won't : False Negative (observed=1,predicted=0)
dtree_model_train_perf = model_performance_classification_sklearn(
dtree, X_train, y_train
)
print("Training performance \n", dtree_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
dtree_model_test_perf = model_performance_classification_sklearn(dtree, X_test, y_test)
print("Testing performance \n", dtree_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.889344 0.724638 0.699301 0.711744
rf = RandomForestClassifier(random_state=1)
rf.fit(X_train, y_train)
RandomForestClassifier(random_state=1)
confusion_matrix_sklearn(rf, X_test, y_test)
rf_model_train_perf = model_performance_classification_sklearn(rf, X_train, y_train)
print("Training performance \n", rf_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
rf_model_test_perf = model_performance_classification_sklearn(rf, X_test, y_test)
print("Testing performance \n", rf_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.910519 0.572464 0.923977 0.706935
Random forest with class weights
rf_wt = RandomForestClassifier(class_weight={0: 0.81, 1: 0.19}, random_state=1)
rf_wt.fit(X_train, y_train)
RandomForestClassifier(class_weight={0: 0.81, 1: 0.19}, random_state=1)
confusion_matrix_sklearn(rf_wt, X_test, y_test)
rf_wt_model_train_perf = model_performance_classification_sklearn(
rf_wt, X_train, y_train
)
print("Training performance \n", rf_wt_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
rf_wt_model_test_perf = model_performance_classification_sklearn(rf_wt, X_test, y_test)
print("Testing performance \n", rf_wt_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.913934 0.597826 0.916667 0.723684
# grid search for bagging classifier
cl1 = DecisionTreeClassifier(random_state=1)
param_grid = {
"base_estimator": [cl1],
"n_estimators": [5, 7, 15, 51, 101],
"max_features": [0.7, 0.8, 0.9, 1],
}
grid = GridSearchCV(
BaggingClassifier(random_state=1, bootstrap=True),
param_grid=param_grid,
scoring="recall",
cv=5,
)
grid.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=BaggingClassifier(random_state=1),
param_grid={'base_estimator': [DecisionTreeClassifier(random_state=1)],
'max_features': [0.7, 0.8, 0.9, 1],
'n_estimators': [5, 7, 15, 51, 101]},
scoring='recall')
## getting the best estimator
bagging_estimator = grid.best_estimator_
bagging_estimator.fit(X_train, y_train)
BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1),
max_features=0.9, n_estimators=101, random_state=1)
confusion_matrix_sklearn(bagging_estimator, X_test, y_test)
bagging_estimator_model_train_perf = model_performance_classification_sklearn(
bagging_estimator, X_train, y_train
)
print("Training performance \n", bagging_estimator_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
bagging_estimator_model_test_perf = model_performance_classification_sklearn(
bagging_estimator, X_test, y_test
)
print("Testing performance \n", bagging_estimator_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.914617 0.630435 0.883249 0.735729
# Choose the type of classifier.
dtree_estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(2, 30),
"min_samples_leaf": [1, 2, 5, 7, 10],
"max_leaf_nodes": [2, 3, 5, 10, 15, None],
"min_impurity_decrease": [0.0001, 0.001, 0.01, 0.1],
}
# Type of scoring used to compare parameter combinations
scorer = make_scorer(recall_score)
# Run the grid search
grid_obj = GridSearchCV(dtree_estimator, parameters, scoring=scorer)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
dtree_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
dtree_estimator.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=19, min_impurity_decrease=0.0001,
random_state=1)
confusion_matrix_sklearn(dtree_estimator, X_test, y_test)
dtree_estimator_model_train_perf = model_performance_classification_sklearn(
dtree_estimator, X_train, y_train
)
print("Training performance \n", dtree_estimator_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 0.988576 0.943925 0.995074 0.968825
dtree_estimator_model_test_perf = model_performance_classification_sklearn(
dtree_estimator, X_test, y_test
)
print("Testing performance \n", dtree_estimator_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.881148 0.648551 0.699219 0.672932
# Choose the type of classifier.
rf_estimator = RandomForestClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"n_estimators": [110, 251, 501],
"min_samples_leaf": np.arange(1, 6, 1),
"max_features": [0.7, 0.9, "log2", "auto"],
"max_samples": [0.7, 0.9, None],
}
# Run the grid search
grid_obj = GridSearchCV(rf_estimator, parameters, scoring="recall", cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
rf_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rf_estimator.fit(X_train, y_train)
RandomForestClassifier(max_features=0.9, n_estimators=501, random_state=1)
confusion_matrix_sklearn(rf_estimator, X_test, y_test)
rf_estimator_model_train_perf = model_performance_classification_sklearn(
rf_estimator, X_train, y_train
)
print("Training performance \n", rf_estimator_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
rf_estimator_model_test_perf = model_performance_classification_sklearn(
rf_estimator, X_test, y_test
)
print("Testing performance \n", rf_estimator_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.919399 0.648551 0.895 0.752101
boosting models here - 'AdaBoost Classifier', 'Gradient Boosting Classifier' and 'XGBoost Classifier'.Accuracy, Precision, Recall, and F1 Score but the metric of interest here is Recall.Recall - It gives the ratio of True positives to Actual positives, so high Recall implies low false negatives, i.e. low chances of predicting a 'buyer' of a package as 'non buyer'.abc = AdaBoostClassifier(random_state=1)
abc.fit(X_train, y_train)
AdaBoostClassifier(random_state=1)
# Using above defined function to get accuracy, recall and precision on train and test set
abc_score = get_metrics_score(abc)
Accuracy on training set : 0.8456356180433509 Accuracy on test set : 0.8360655737704918 Recall on training set : 0.3442367601246106 Recall on test set : 0.322463768115942 Precision on training set : 0.6758409785932722 Precision on test set : 0.6267605633802817
# Plot the confusion matrix
make_confusion_matrix(abc, y_test)
gbc = GradientBoostingClassifier(random_state=1)
gbc.fit(X_train, y_train)
GradientBoostingClassifier(random_state=1)
# Using above defined function to get accuracy, recall and precision on train and test set
gbc_score = get_metrics_score(gbc)
Accuracy on training set : 0.8863503222026948 Accuracy on test set : 0.8545081967213115 Recall on training set : 0.4672897196261682 Recall on test set : 0.3695652173913043 Precision on training set : 0.8670520231213873 Precision on test set : 0.723404255319149
# Plot the confusion matrix
make_confusion_matrix(gbc, y_test)
xgb = XGBClassifier(random_state=1, eval_metric="logloss")
xgb.fit(X_train, y_train)
XGBClassifier(eval_metric='logloss', random_state=1)
# Using above defined function to get accuracy, recall and precision on train and test set
xgb_score = get_metrics_score(xgb)
Accuracy on training set : 0.880199179847686 Accuracy on test set : 0.85724043715847 Recall on training set : 0.43457943925233644 Recall on test set : 0.35507246376811596 Precision on training set : 0.8584615384615385 Precision on test set : 0.7596899224806202
make_confusion_matrix(xgb, y_test)
With default parameters:
# Choose the type of classifier.
abc_tuned = AdaBoostClassifier(random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {
# Let's try different max_depth for base_estimator
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
"n_estimators": np.arange(10, 110, 10),
"learning_rate": np.arange(0.1, 2, 0.1),
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)
# Run the grid search
grid_obj = GridSearchCV(abc_tuned, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
abc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
abc_tuned.fit(X_train, y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=1.6, n_estimators=100, random_state=1)
# Using above defined function to get accuracy, recall and precision on train and test set
abc_tuned_score = get_metrics_score(abc_tuned)
Accuracy on training set : 0.986233157586409 Accuracy on test set : 0.8442622950819673 Recall on training set : 0.942367601246106 Recall on test set : 0.5434782608695652 Precision on training set : 0.983739837398374 Precision on test set : 0.5952380952380952
make_confusion_matrix(abc_tuned, y_test)
importances = abc_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
'MonthlyIncome' is the most important feature as per the tuned AdaBoost model.Let's try using AdaBoost classifier as the estimator for initial predictions
gbc_init = GradientBoostingClassifier(
init=AdaBoostClassifier(random_state=1), random_state=1
)
gbc_init.fit(X_train, y_train)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
random_state=1)
# Using above defined function to get accuracy, recall and precision on train and test set
gbc_init_score = get_metrics_score(gbc_init)
Accuracy on training set : 0.8866432337434095 Accuracy on test set : 0.8545081967213115 Recall on training set : 0.46417445482866043 Recall on test set : 0.36594202898550726 Precision on training set : 0.873900293255132 Precision on test set : 0.7266187050359713
As compared to the model with default parameters:
# Choose the type of classifier.
gbc_tuned = GradientBoostingClassifier(
init=AdaBoostClassifier(random_state=1), random_state=1
)
# Grid of parameters to choose from
## add from article
parameters = {
"n_estimators": [100, 150, 200, 250],
"subsample": [0.8, 0.9, 1],
"max_features": [0.7, 0.8, 0.9, 1],
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)
# Run the grid search
grid_obj = GridSearchCV(gbc_tuned, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
gbc_tuned.fit(X_train, y_train)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.8, n_estimators=250, random_state=1,
subsample=0.9)
# Using above defined function to get accuracy, recall and precision on train and test set
gbc_tuned_score = get_metrics_score(gbc_tuned)
Accuracy on training set : 0.9212067955477445 Accuracy on test set : 0.869535519125683 Recall on training set : 0.618380062305296 Recall on test set : 0.44565217391304346 Precision on training set : 0.9429928741092637 Precision on test set : 0.7639751552795031
make_confusion_matrix(gbc_tuned, y_test)
importances = gbc_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
'MonthlyIncome' is the most important feature, followed by 'Age', 'Passport' and 'DurationOfPitch', as per the tuned gradient boosting modelXGBoost has many hyper parameters which can be tuned to increase the model performance. You can read about them in the xgboost documentation here. Some of the important parameters are:
# Choose the type of classifier.
xgb_tuned = XGBClassifier(random_state=1, eval_metric="logloss")
# Grid of parameters to choose from
## add from
parameters = {
"n_estimators": np.arange(10, 100, 20),
"scale_pos_weight": [0, 1, 2, 5],
"subsample": [0.5, 0.7, 0.9, 1],
"learning_rate": [0.01, 0.1, 0.2, 0.05],
"gamma": [0, 1, 3],
"colsample_bytree": [0.5, 0.7, 0.9, 1],
"colsample_bylevel": [0.5, 0.7, 0.9, 1],
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)
# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)
XGBClassifier(colsample_bylevel=0.5, colsample_bytree=0.5,
eval_metric='logloss', learning_rate=0.01, n_estimators=30,
random_state=1, scale_pos_weight=5, subsample=0.9)
# Using above defined function to get accuracy, recall and precision on train and test set
xgb_tuned_score = get_metrics_score(xgb_tuned)
Accuracy on training set : 0.6994727592267135 Accuracy on test set : 0.6653005464480874 Recall on training set : 0.7741433021806854 Recall on test set : 0.7681159420289855 Precision on training set : 0.3606676342525399 Precision on test set : 0.3322884012539185
make_confusion_matrix(xgb_tuned, y_test)
importances = xgb_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
'Passport' is the most important feature as per XGBoost model unlike AdaBoost and Gradient Boosting, where the most important feature is the 'MonthlyIncome'.# defining list of models
models = {
# -------- bagging models --------
"bagging": bagging, # bagging
"bagging weighted": bagging_wt, # bagging weighted
"decision tree": dtree, # decision tree
"random forest": rf, # random forest
"random forest weighted": rf_wt, # random forest weighted
"bagging tuned": bagging_estimator, # bagging tuned
"decision tree tuned": dtree_estimator, # decision tree tuned
"random forest tuned": rf_estimator, # random forest tuned
# -------- boosting models --------
"adaboost with default parameters": abc, # adaboost with default parameters
"adaboost tuned": abc_tuned, # adaboost tuned
"gradient boosting with default parameters": gbc, # gradient boosting with default parameters
"gradient boosting with init=adaboost": gbc_init, # gradient boosting with init=AdaBoost
"gradient boosting tuned": gbc_tuned, # gradient boosting tuned
"xgboost with default parameters": xgb, # xgboost with default parameters
"xgboost tuned": xgb_tuned, # xgboost tuned
}
# dataframe consolidating all models metrics for `'bagging'` and `'boosting'` on train and test
df_models = pd.DataFrame()
for model_id, model in models.items():
df_concat = pd.DataFrame()
for split in ["train", "test"]:
if split == "train":
df_train = np.round(
model_performance_classification_sklearn(model, X_train, y_train), 2
)
df_train = pd.concat(
[pd.DataFrame([split], index=[0], columns=["split"]), df_train],
axis=1,
)
else:
df_test = np.round(
model_performance_classification_sklearn(model, X_test, y_test), 2
)
df_test = pd.concat(
[pd.DataFrame([split], index=[0], columns=["split"]), df_test], axis=1
)
# concatenated training and test results for `'model_id'` models
df_concat = pd.concat([df_train, df_test], axis=1)
df_concat.index = [model_id]
df_models = pd.concat([df_models, df_concat], axis=0)
df_models
| split | Accuracy | Recall | Precision | F1 | split | Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|---|---|---|---|---|---|
| bagging | train | 0.99 | 0.96 | 1.00 | 0.98 | test | 0.89 | 0.53 | 0.86 | 0.65 |
| bagging weighted | train | 0.99 | 0.97 | 1.00 | 0.98 | test | 0.90 | 0.62 | 0.82 | 0.70 |
| decision tree | train | 1.00 | 1.00 | 1.00 | 1.00 | test | 0.89 | 0.72 | 0.70 | 0.71 |
| random forest | train | 1.00 | 1.00 | 1.00 | 1.00 | test | 0.91 | 0.57 | 0.92 | 0.71 |
| random forest weighted | train | 1.00 | 1.00 | 1.00 | 1.00 | test | 0.91 | 0.60 | 0.92 | 0.72 |
| bagging tuned | train | 1.00 | 1.00 | 1.00 | 1.00 | test | 0.91 | 0.63 | 0.88 | 0.74 |
| decision tree tuned | train | 0.99 | 0.94 | 1.00 | 0.97 | test | 0.88 | 0.65 | 0.70 | 0.67 |
| random forest tuned | train | 1.00 | 1.00 | 1.00 | 1.00 | test | 0.92 | 0.65 | 0.90 | 0.75 |
| adaboost with default parameters | train | 0.85 | 0.34 | 0.68 | 0.46 | test | 0.84 | 0.32 | 0.63 | 0.43 |
| adaboost tuned | train | 0.99 | 0.94 | 0.98 | 0.96 | test | 0.84 | 0.54 | 0.60 | 0.57 |
| gradient boosting with default parameters | train | 0.89 | 0.47 | 0.87 | 0.61 | test | 0.85 | 0.37 | 0.72 | 0.49 |
| gradient boosting with init=adaboost | train | 0.89 | 0.46 | 0.87 | 0.61 | test | 0.85 | 0.37 | 0.73 | 0.49 |
| gradient boosting tuned | train | 0.92 | 0.62 | 0.94 | 0.75 | test | 0.87 | 0.45 | 0.76 | 0.56 |
| xgboost with default parameters | train | 0.88 | 0.43 | 0.86 | 0.58 | test | 0.86 | 0.36 | 0.76 | 0.48 |
| xgboost tuned | train | 0.70 | 0.77 | 0.36 | 0.49 | test | 0.67 | 0.77 | 0.33 | 0.46 |
Cost Function quantifies the error between predicted values and expected values and presents it in the form of a single real number."Visit for Us"'s main aim would be to balance the trade off between losing an opportunity (to gain money by selling packages) in case of FP and losing the money in case of FN.'Recall' is the metric of interest here and we tuned our model on 'Recall'. But this does not mean that other metrics should be ignored completely."Visit for Us" will actually be losing money in the longer run.From the consolidated table of results with metrics, the highest 'Accuracy' on 'testing' is for the following models random forest, random forest weighted, xgboost with defalt parameters the same with 0.91, and random forest tuned with 0.92.
Highest values of 'Recall' on testing are from decision tree at 0.72, and xgboost tuned at 0.77. Their respective values on training are 1.00 and 0.97.
Although these values looks a little low (at least for testing) for the problem at hand, we believe these values actually are the more stable values from these two models and the xgboost tuned tend to be more generalizable than decision tree.
On the importance of variables, 'Passport' is the most important feature on the xgboost tuned model, followed by 'MaritalStatus', 'Age', and 'MonthlyIncome'. Unlike Gradient Boosting, where the most important feature is the 'MonthlyIncome', followed by 'Age', 'Passport', and 'DurationOfPitch'. In AdaBoost, the order of importance is 'MonthlyIncome', 'DurationOfPitch', 'Age', and 'NumberOfTrips' respectively from higher to lower.
We were able to build an xgboost tuned model that is able to generalize by optimizing the Recall metric by using cross validation.
We recommend to focus on customers meeting criteria around 'Passport', 'MaritalStatus', 'Age', and 'MonthlyIncome' as per the xgboost tuned model results. Focusing in that we can 'care' about these features will support the fact the model is providing with 'reasonable' results that has a limited number of $False\ Negatives$.
# !jupyter nbconvert --to html --template full Project4_Travel_Package_Purchase_Prediction.ipynb